Distributed Balanced Clustering via Mapping Coresets

نویسندگان

  • MohammadHossein Bateni
  • Aditya Bhaskara
  • Silvio Lattanzi
  • Vahab S. Mirrokni
چکیده

Large-scale clustering of data points in metric spaces is an important problem in mining big data sets. For many applications, we face explicit or implicit size constraints for each cluster which leads to the problem of clustering under capacity constraints or the “balanced clustering” problem. Although the balanced clustering problem has been widely studied, developing a theoretically sound distributed algorithm remains an open problem. In this paper we develop a new framework based on “mapping coresets” to tackle this issue. Our technique results in first distributed approximation algorithms for balanced clustering problems for a wide range of clustering objective functions such as k-center, k-median, and k-means.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Scalable and Distributed Clustering via Lightweight Coresets

Coresets are compact representations of data sets such that models trained on a coreset are provably competitive with models trained on the full data set. As such, they have been successfully used to scale up clustering models to massive data sets. While existing approaches generally only allow for multiplicative approximation errors, we propose a novel notion of coresets called lightweight cor...

متن کامل

On Coreset Constructions for the Fuzzy $K$-Means Problem

In this paper, we present coreset constructions for the fuzzy Kmeans problem. First, we show that one can construct a weak coresets for fuzzy K-means. Second, we show that there are coresets for fuzzy K-means with respect to balanced fuzzy K-means solutions. Third, we use these coresets to develop a randomized approximation algorithm whose runtime is polynomial in the number of the given points...

متن کامل

Turning big data into tiny data: Constant-size coresets for k-means, PCA and projective clustering

We prove that the sum of the squared Euclidean distances from the n rows of an n×d matrix A to any compact set that is spanned by k vectors in R can be approximated up to (1+ε)-factor, for an arbitrary small ε > 0, using the O(k/ε)-rank approximation of A and a constant. This implies, for example, that the optimal k-means clustering of the rows of A is (1+ε)approximated by an optimal k-means cl...

متن کامل

On the Sensitivity of Shape Fitting Problems

In this article, we study shape fitting problems, -coresets, and total sensitivity. We focus on the (j, k)-projective clustering problems, including k-median/k-means, k-line clustering, j-subspace approximation, and the integer (j, k)-projective clustering problem. We derive upper bounds of total sensitivities for these problems, and obtain -coresets using these upper bounds. Using a dimension-...

متن کامل

Strong Coresets for Hard and Soft Bregman Clustering with Applications to Exponential Family Mixtures

Coresets are e cient representations of data sets such that models trained on the coreset are provably competitive with models trained on the original data set. As such, they have been successfully used to scale up clustering models such as K-Means and Gaussian mixture models to massive data sets. However, until now, the algorithms and the corresponding theory were usually specific to each clus...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014